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(54) COMPOSITE IMAGE DISPLAY SYSTEM 

(57)Abstract: 

PURPOSE: To make a display exactly as a sending-side 
person intends by embedding parameters of a composite 
image that a person who composes data desires on an 
expressing side in the composite data on a face image 
and using the values em bedded in the image data as 
initial values of a system on a display side. 
CONSTITUTION: Mapping to a face model is performed 
on the basis of the original image of the desired face 
image to be displayed on a reception side, and 
parameters of respective mouth shapes are used to 
generate the composite image data, in which the 
parameters regarding an impression to be given to the 
opposite reception side are embedded. Document 
information is generated separately and the both are 
sent as transmitted data to the reception side. The data 
are inputted to a transmitted data input part 8; and 
document information is sent to a document 
decomposition part 1, the composite image data is sent 
to an image memory 6, and various parameters are sent 
to a parameter input part 7. The parameter input part 7 once receiving the parameters checks 
and sends them to an image display control part 5 and the image memory 6. On the reception 
side, the information is displayed exactly as the sending-side person intends. 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

LThis document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1]In an image composing display system which generates synthetic video of a person s 
face in which a mouth moves according to synthesized speech and this synthesized speech 
corresponding to it from arbitrary text data, An image composing display system constituting so 
that various parameters for deciding a generation mode of synthesized speech and synthetic 
video by the side of a display may be added to this image composing and the display side may be 
passed, when creating image composing of a face to the creation side of text data. 
[Claim 2]The image composing display system according to claim 1 which is a parameter 
including display magnification at the time of these various parameters displaying vocal quality of 
synthesized speech, and synthetic video, and a display position. 
[Claim 3]An image composing display system comprising: 

A transmission data input means which divides received transmission data into composited 
dynamic image data, text data, and various parameters. 

A voice synthesis means which generates and outputs synthesized speech based on these text 
data. 

An image memory which files composited dynamic image data separated by this transmission 
data input means. 

A conversion method changed into a series of mouth form numerals showing a motion of a series 
of mouth type when the text data are uttered for these text data, A pronunciation time 
calculation means which calculates pronunciation time of each syllable of synthesized speech 
outputted from this voice synthesis means based on these text data, and presumes timing of a 
break of each sound, A display control means which performs control which switches a display 
image to a mouth form picture corresponding to mouth form numerals from this conversion 
method in timing of a break of each syllable presumed by this pronunciation time calculation 
means, and a parameter input means to send various parameters separated by this transmission 
data input means to a corresponding internal circuit. 



[Translation done.] 
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DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application]This invention relates to an image composing display system applicable to 
AV (audio video) E-mail etc. which can tell the other party a message by the synthetic video and 
synthesized speech of the face with which the sending person has talked just like TV telephone 
only by sending text (text) data, and especially, It is related with the image composing display 
system which can tell impressions which are the creation sides of a text etc. and were meant, 
such as vocal quality and face expression, to the display side. 
[0002] 

[Description of the Prior Art]The art of generating the synthesized speech corresponding to it 
freely, and pronouncing it from arbitrary text (text) information is called rule voice synthesis, and 
the rule voice synthesizer for realizing this is already made. This rule speech synthesis technique 
is applied in various fields in order to raise the interface of human being and machinery. The art 
which generates the video of a person including a motion of a mouth when it is spoken from 
arbitrary text data in analyzing the text data is developed like audio composition in recent years, 
By combining this with above-mentioned speech synthesis technique, a more natural interface is 
realizable. 

[0003]For example, by preparing beforehand data files, such as a transmitting mail person's face 
picture, for the receiver, if the synthetic art of this sound and face video is applied to an E-mail, 
In the former, the message of a rich expression of the video of a face in which the transmitting 
mail person has talked appearing to the E-mail with which the text was only displayed on the 
screen of a receiver, and reading out by synthesized speech can be told to an addressee. 
[0004]The example of composition of the sound and video output unit which compounds and 
outputs a sound and face video based on such a text is shown in drawing 4 . In drawing 4 , 1 is a 
text decomposition part into which text (text) information is inputted, and this text 
decomposition part 1 analyzes the inputted text data, generates the sound control data for voice 
response, and outputs it to the rule speech synthesis section 2, and the sound / mouth form 
converter 3. as text data — "now" — when a text is inputted, this is decomposed and outputted 
to the phoneme data which consists of the vowel and consonant of "T, A, D, A, I, M, A." 
[0005]The rule speech synthesis section 2 is a device which generates and outputs the 
synthesized speech which reads out the text based on the phoneme data about arbitrary texts. 
[0006]A sound / mouth form converter 3 is the devices for changing into the series of the 
mouth form numerals for expressing a motion of a series of mouths at the time of pronouncing 
the text for the phoneme data about arbitrary texts. As mouth form numerals, there are seven 
kinds, A (vowel A), I (vowel I), U (vowel U), E (vowel E), O (vowel O), S (consonant), and C 
(closed mouth), and the picture of the mouth type at the time of pronouncing them 
corresponding to each mouth form numerals is prepared beforehand, as text data — the above- 
mentioned — "now", when a text is inputted, Based on the phoneme data "TADAIMA" of the 
text, "T" -> mouth type numerals S. "A" -> mouth type "numerals A and D" -> the mouth form 
"numerals S and I" -> mouth type "numerals I and M" -> mouth type "numerals C and A" -> 
mouth type numerals A are assigned, respectively, and they are outputted to the image display 
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controller 5 as a series of mouth form numerals. 

[0007]Composite image data is filed in the image memory 6. As this composite image data, the 
data of the shoulder top picture for one frame of a speaker and seven kinds of mouth region 
images corresponding to seven kinds of above-mentioned mouth form numerals compounded 
based on it is gathered, and it is considered as one file. 

[0008]The pronunciation time calculation part 4 calculates time until each syllable at the time of 
compounding a sound using the completely same algorithm as the rule speech synthesis section 
2 based on the sound control data from the text decomposition part 1 is pronounced, 
respectively. That is, when the pronunciation output of it is synthesized voice from it and carried 
out by the rule speech synthesis section 2 to the inputted text, the timing of the break of each 
syllable which constitutes the text with the head of a text as the starting point is presumed, 
respectively, and the result is outputted to the image display controller 5. 
[0009]The image display controller 5 performs image display control so that the mouth form 
picture corresponding to the mouth form numerals of the applicable syllable may be chosen from 
the image memory 6 and may be outputted, when the pronouncing timing of each syllable comes 
based on the timing signal from the pronunciation time calculation part 4. That is, synchronous 
control is performed so that the synchronization with synthesized speech and face video can be 
taken, so that a motion of a speaker's mouth displayed on a screen to the sound pronounced by 
the rule speech synthesis section 2 may be in agreement that is,. 

[0010]The vocal quality of the sound which compounds the parameter inputting part 7 by the 
rule speech synthesis section 2, the display place on the screen of face video, It is a portion 
which inputs various parameters, such as display magnification, using a keyboard etc., and the 
parameter about synthesized speech is passed to the rule speech synthesis section 2, and the 
parameter about face video is passed to the image display controller 5 and the image memory 6. 
[001 1]Operation of the device constituted in this way is explained. If text data are inputted, the 
text data will be analyzed by the text decomposition part 1, phoneme data will collect, the rule 
speech synthesis section 2 will be passed, and a pronunciation output will be carried out by 
synthesized speech. In parallel to this pronunciation operation, phoneme data is changed into the 
series of mouth form numerals by the sound / mouth form converter 3. In the pronunciation time 
calculation part 4, the time of the break of each syllable is presumed from phoneme data, and 
this temporal data is passed to the image display controller 5. The timing of mouth form 
numerals is doubled with the pronouncing timing of each syllable in the image display controller 5, 
The face dynamic image data corresponding to the mouth form numerals which were able to be 
found in the sound / mouth form converter 3 from the inside of the picture of each mouth form 
numerals developed on the image memory 6 is transmitted to VRAM, and displays a speaker's 
face video on the screen of a display via this VRAM. Text data will be given to an addressee by 
this as a message by the face video of the speaker who had the timing of a motion of a mouth in 
the synthesized speech which actually pronounced it, and its synthesized speech. 
[0012]The device of this drawing 4 is realizable as a small and economical system by using a 
certain small voice synthesis unit for the rule speech synthesis section 2 from the former, and 
using a personal computer etc. for the other portion. 
[0013] 

[Problem(s) to be Solved by the Invention]When realizing this sound and face video output unit 
on a personal computer, generally what is switched and displayed according to the text which 
creates image composing beforehand, and into which those pictures were inputted as mentioned 
above is performed for throughput reduction. In generating synthesized speech and face video in 
these devices, what was beforehand set as the system which displays parameters, such as vocal 
quality, a display place of the picture on a screen, and display magnification, as an initial value 
(what was beforehand inputted by the parameter inputting part 7) is used. 

[0014]Thus, although the generation mode of face video is beforehand set to the vocal quality of 
synthesized speech by a receiver with the conventional device, unnatural sensibility will be given 
to those who see it when the person and vocal quality of the face picture which have been 
******** registered do not balance to the message. 

[0015]So that it may be represented, when this device is used for an E-mail etc., When those 
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who actually display and look at text data, the person who made image composing, and its text 
data with a sound and video differ, in the size of vocal quality and a picture for which the person 
who made text data and image composing wishes. In a receiver, pronunciation and the impression 
which image display is not always carried out and is completely different from an intention of the 
person of the delivery side as a result may be given to the person of a receiver. 
[0016]That is, it will be decided with the parameter which the impression which the person of the 
display side receives from the vocal quality which hits telling a message with a sound and video 
in the conventional device, the looks of a face, etc. is a display side, and was set beforehand, It 
was not able to tell exactly the person of the display-impression expression which person of the 
creation side of information meant side. 

[0017]This invention is made in view of this problem, and it is a display side, and in displaying 
synthesized speech or face image composing based on text data, the place made into the 
purpose is to make it a display as the person of the creation side of the text meant attained. 
[0018] 

[Means for Solving the Problerrj Drawing 1 is a principle explanatory view concerning this 
invention. In an image composing display system which generates synthetic video of a person's 
face in which a mouth moves as one gestalt according to synthesized speech and this 
synthesized speech corresponding to it from arbitrary text data in this invention, When creating 
image composing of a face to the creation side of text data, an image composing display system 
constituting so that various parameters for deciding a generation mode of synthesized speech 
and synthetic video by the side of a display may be added to image composing and the display 
side may be passed is provided. 

[0019]The above-mentioned various parameters can be made into a parameter including display 
magnification at the time of displaying vocal quality of synthesized speech, and synthetic video, 
and a display position. 

[0020]A transmission data input means which divides received transmission data into composited 
dynamic image data, text data, and various parameters as other gestalten in this invention, A 
voice synthesis means which generates and outputs synthesized speech based on text data, and 
an image memory which files composited dynamic image data separated by a transmission data 
input means, A conversion method changed into a series of mouth form numerals showing a 
motion of a series of mouth type when the text data are uttered for text data, A pronunciation 
time calculation means which calculates pronunciation time of each syllable of synthesized 
speech outputted from this voice synthesis means based on text data, and presumes timing of a 
break of each sound, A display control means which performs control which switches a display 
image to a mouth form picture corresponding to mouth form numerals from this conversion 
method in timing of a break of each syllable presumed by a pronunciation time calculation means, 
An image composing display system provided with a parameter input means to send various 
parameters separated by a transmission data input means to a corresponding internal circuit is 
provided. 
[0021] 

[Function]In the image composing display type of this invention, in the transmitting side, when a 
face picture required for a display is combined, the display magnification of image composing, the 
display position, the vocal quality of synthesized speech, and other parameters which the person 
who is a display side and compounded to the same data wishes are embedded. In the display 
side, the value embedded at the image data is used as an initial value of a system. Synthesized 
speech and image composing are generable by the display system side as the person who 
combined the face picture meant by this. 

[0022]In the image composing display system of other gestalten of this invention, The 
transmission data received by the transmission data input means Composited dynamic image 
data, text data, Separate into various parameters, analyze text data by a text decomposing 
means, and sound control data is generated, Based on this sound control data, synthesized 
speech is generated and outputted by a voice synthesis means, File the received composited 
dynamic image data in an image memory, and sound control data is changed into the series of 
mouth form numerals by a conversion method, Calculate the pronunciation time of each syllable 
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pronounced by a voice synthesis means based on sound control data by a pronunciation time 
calculation means, respectively, and the timing of the break of each syllable is presumed, It 
controls to read the mouth form picture of the syllable from an image memory according to the 
timing signal of each syllable by an image display control means, The various parameters which 
received are sent to an internal circuit corresponding by a parameter input means, and it is made 
to become synthesized speech and the thing in which the person of the text creation side meant 
the generation mode of image composing. 
[0023] 

[Example]Hereafter, the example of this invention is described with reference to drawings. A 
sound and a face video output unit according [ drawing 2 ] to the image composing display 
system as one example of this invention are shown. In drawing 2 , the text decomposition part 1, 
the rule speech synthesis section 2, the sound / mouth form converter 3, the pronunciation time 
calculation part 4, the image display controller 5, and the image memory 6 are the same as what 
was explained by the above-mentioned conventional example. 

[0024]To the transmission data sent from the transmitting side, as a point of difference with a 
device conventionally. The parameter of the display magnification on composite image data, such 
as a face picture the person of the transmitting side other than original text data expects what 
is compounded and displayed by a receiver, and a mouth form picture, and the screen which the 
person wishes to have further, a display position, vocal quality, and others is embedded at 
composite image data. 

[0025]The concept of the processing for embedding these parameters to transmission data in 
the transmitting side is shown in drawing 3. Mapping to a face model is performed based on the 
original image of the face picture which wishes the display by a receiver, composite image data is 
created using the parameter of each mouth type, and the parameter concerning the impression 
given to this at the other display magnification, display position, vocal quality, and receptacle side 
is embedded. Text data are created apart from this, and both sides are used as transmission 
data and seen off in a receiver. In this case, once it sends the composite image data in which the 
parameter was embedded, the rest should just repeat and send text data. 

[0026]In a receiver, this transmission data is inputted into the transmission data input part 8, it 
separates into text data, composite image data, and various parameters, and send text data to 
the text decomposition part 1, composite image data is sent to the image memory 6, and various 
parameters are sent to the parameter inputting part 7 here, respectively. 
[0027]If the parameter inputting part 7 receives these various parameters, these various 
parameters will be investigated and parameters, such as vocal quality about voice synthesis, will 
send the parameter about the picture of the display magnification of a picture, a display position, 
etc. to the image display controller 5 and the image memory 6 at the rule speech synthesis 
section 2, respectively. 

[0028]With constituting in this way, the parameter of the face picture for which the person of 
the transmitting side wished, display magnification and a display position, vocal quality, and 
others can be used at a receiver as a parameter embedded as the face picture which should be 
displayed, and an initial value of a system. Therefore, a message can be displayed on the display 
system of a receiver by the sound and picture as an intention of the person of the transmitting 
side. 

[0029]In operation of this invention, various modification gestalten are possible. For example, 
although the above-mentioned example explained the case where seven kinds of pictures were 
used as a mouth form, of course, this invention is not restricted to this, and in order to 
compound a motion of the mouth nearer to nature, it may increase the kind of picture of this 
mouth type further. Although the motion of a mouth field was taken up in the above-mentioned 
example as a motion portion of the face picture which is a display side and is combined, if it is 
not restricted to this, and a mouth moves, for example, and it is alike, in addition is made to 
change a motion of eyes etc. according to a text, AV message of a richer expression can be sent 
to the receptacle side. 

[0030]Although the above-mentioned example explained the case where this invention was 
applied to AV E:-mail, It is also possible for this invention not to be restricted to this and to 
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apply to a sound and a face video output unit simple substance, and, Or if recognition of the 
phoneme of an utterance sound is attained, for example by speech recognition technology in real 
time, it is also possible to apply to service of the false TV phone that the expression of a 
speaker's face can also be displayed on the addressee side with video only by making the usual 
phone call etc. 
[0031] 

[Effect of the Invention]As explained above, according to this invention, in displaying synthesized 
speech or face image composing based on text data by a receiver, a display as the person of the 
delivery side of the text meant is attained. 



[Translation done.] 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

LThis document has been translated by computer. So the translation may not reflect the original 
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3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1]It is a principle explanatory view concerning this invention. 

[Drawing 2]It is a figure showing the sound and video output unit by the image composing display 
system as one example of this invention. 

[Drawing 3] It is a figure explaining the processing concept by the side of delivery by an example 
system. 

[Drawing 4]It is a figure showing the conventional sound and video output unit. 
[Description of Notations] 

1 Text decomposition part 

2 Rule speech synthesis section 

3 A sound / mouth form converter 

4 Pronunciation time calculation part 

5 Image display controller 

6 Image memory 

7 Parameter inputting part 

8 Transmission data input part 



[Translation done.] 
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v ), ^mxmm-ite^xTA.tLxmm.-fzztftxz 

[0013] 

m<9&A.xm^i-% - t fr-mz.nt>tix\,^z>. znb 

[0014] Z\<7>£ r> W**«git1iMtr»rK 

tmmm&<D±faw t mzmtmxi'ih&i£Lxto < to 
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»£\ *ft**5Al::*£&fci^£-¥*.-CL*5;:i 
[00 15] ifc, ^rogaSrH^-^-^iflcffl^fe 

o fcA t -t a* ^us^if 1 ^ t mmmxm^Lx 

jL5Aias&fc5»£, X*ifilt^iii^^ffofcA 

roArotSI t it 5 PPfc«:£teffl!ltf> AI'^-jiT LS 
[0 0 16] o^^^w^ifli, ^tidB^T'^ 

WA^Eia LfcPPfc^m&S^ffliroAf-Wflfel-fetS r. 
tfrX%t£fi^tz 0 

[0 0 17] *%WnfrfrZ>fflm&lZ&?tXtd:£tltzh 
(DXh<9, Z<DB#}b1-Z>bZZ\±, <&7rtmX'Xmtitn 

izm^^x^^ph^^im^mnk^^t^^h 

[0018] 

m.mmxfoz> a *mwia$\,^xn s —z>mmbLx, 
j&^pfc&trttx a tm < xmnm^tfrnmiik*^ 
\zte\<^xM<D^fm\&*fr$LT 5 m \z , s^fiijfcio § 

S ixfc ~ b ir -f 5 £f&mi%H^ > * °r A #fl§& S 

[0019] ±tecD#«/^^-^(4#^^ro^ 
[0020] ^tz^mmi^^xn, mmmbLx, 

9 ^ ic^f-T 5 - * Art i\ X*W 

aiiS^^T^^^^Lmrt-rsw^^a 

t , ^ Art ¥®-e^K $ ^ 

^7 7 ^f y ^^tsi* t, i:*'l*fftlr^(755:* 

if iS %mP L /c t f <Z)-ig© p ^coid # Sr*-f p 

smuw-^m b , 3BW«pnw-j|[#a-c«ijEufc«-#tt© 
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[0 0 2 1 ] 

urn) x&WK&tim&itTFttiz^^x^ mmm 

LfcA^SHLfdil 9 icS^v-^r^ftJT-^ 10 
&MPb i?f&mt» Z tfc-f <bZ.il y5*T# <5 0 

[0022] s tz^m<n\&mm<D£$m&ii^y* 

T'-Mc&V^TIi, - * ATI #® TSm L /cfHiir 

-^^i&ftiHix-^ :Sc*tif«u 

^©©^BfR HWI LTS-irffiWSJftS W 

5 y m^iz^-t^x^o^nm p Bmi££m&/ * y 

[ 0 0 2 3 ] 

^tmz^^p •mmfcm-hmw.ti^ztiz, 02 
ic&int, x^»spu mmp^fmz, %p/n 

ft * * y 6 [itfra w^^iT'Uis tfc <b?> t m c t 

[ 0 0 2 4 ] »|£BiCDfBii,£i LT, ^f»>ib^ 

<7)/n y * - $ H£ J&W&T '- 9 ieSftiA 4 tlX v n 5 „ 
[0025] 0 3 \a$miiWizjs\,^x Z\fr<b<D/<y 

So %\m\x(D%^z%mtz>Mm{n.<Dwmmzm^^ 
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7t f£\ %. Kxwtk mzm n m Lm 5 t± n x & \ \ 

[0026] Sft«t*ij;r©eif-^lifigf-^A 

lie, ^Mffx-^fiMff y^ty etc, #«/<7^- 
^{4^7P«-^A^gB7lc:^tbm^fcix5c 
[0 0 2 7 ] ^7^-^A^SP7l4-ro#a^7^-^ 

i'M1-5^®^»^7 ^ t4^Ii]#^-a-J*g|5 2 Ic, H 
4fero*jSfSF*, **{4«^roB^ Hf 5 ^< 7 ^ - 9 It 

mfo&^fflmn 5 t®ft^ =e y 6 t-tnm^^o 
[0028] :roi9i:i^t5rtT, SfffflT(±, 

y/-9tLx, mmm<DAtf$rm.Ltzmm&ts 

t^-c#5„ ioT^mffllKOAwSiaji'P W^iriH^ 

[0 0 2 9 ] *&W<DgM\chtz^X\-tm* <DWtWW< 
ft«imxhz> 0 fi»Jx«, ±^^tS0i|-ettP^t LT 7 

P (Oft f Sr^i- 5 /c ft I- (4 r P ff^co lift W««^r $ 
h \zm-*? LX J; v\ 4 /t±i* ©IIJS^ T (4*^ffl T-g- 
^SffiBffroftf^t LTnfB#©ft#£J|x!)±tf 

Jp^T, X^lc-g-^-frTawftt^^t^ft^it^i; 5 

[0030] ^tz^m^mmmxit^m^Avm^^ 

L^<o«o«if tftsfft-c-^^-e t 5 1 1 > 5 «ffl7 U- 1" 

[0 0 3 1 ] 

[si] *«wir^5saiawia-ca55„ 

[03] mmmi'ZTJxizzz&Qwzo &mm& s:R 

[H4] fl&jferolF* . ftB^ai^B^I-iaT'^So 
[^<D|ft^] 
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